Take-home_Ex01

Author

Wan Xinyu

1. The task

The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.

2. Data preparation

2.1 Installing the data packages

pacman::p_load(ggplot2, ggiraph, plotly, 
               patchwork, DT, tidyverse,
               ggrepel, ggthemes, hrbrthemes,
               tidyverse,ggstatsplot,pals,readxl, 
               performance, parameters, see) 

ggplot2: A data visualization package for creating graphics in R. It provides a flexible and powerful framework for creating elegant and complex graphs. Documentation: https://ggplot2.tidyverse.org/

ggiraph: An extension to ggplot2 that allows you to add interactivity to your ggplot2 graphics. It provides a set of functions that enable you to create interactive tooltips and clickable regions on your plots. Documentation: https://davidgohel.github.io/ggiraph/index.html

plotly: A data visualization package that allows you to create interactive, web-based plots in R. It provides a wide range of chart types, including scatter plots, line charts, and heatmaps. Documentation: https://plotly.com/r/

patchwork: A package for combining multiple ggplots into a single plot. It provides a flexible and powerful framework for arranging, annotating, and styling your plots. Documentation: https://patchwork.data-imaginist.com/

DT: A package for creating interactive data tables in R. It provides a wide range of features, including sorting, filtering, and pagination. Documentation: https://rstudio.github.io/DT/

tidyverse: A collection of packages for data manipulation and visualization in R. It provides a consistent set of tools for working with data, including data cleaning, transformation, and visualization. Documentation: https://www.tidyverse.org/

ggrepel: A package for avoiding overplotting of text labels in ggplot2. It provides a set of functions for positioning labels in a way that minimizes overlap and maximizes readability. Documentation: https://ggrepel.slowkow.com/

ggthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that can be used to customize the appearance of your plots. Documentation: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

hrbrthemes: A package for creating visually appealing themes for ggplot2 graphics. It provides a set of pre-designed themes that are designed to be aesthetically pleasing and easy to read. Documentation: https://hrbrmstr.github.io/hrbrthemes/

ggstatsplot: A package for creating publication-ready plots with statistical analysis built-in. It provides a set of functions for creating common statistical plots, including scatterplots, boxplots, and histograms, with statistical tests and summary information included. Documentation: https://indrajeetpatil.github.io/ggstatsplot/

pals: A package for creating color palettes for data visualization. It provides a set of functions for generating color palettes based on various themes and color schemes. Documentation: https://cran.r-project.org/web/packages/pals/vignettes/pals_examples.html

readxl: A package for reading Excel files in R. It provides a set of functions for importing data from Excel spreadsheets into R data frames. Documentation: https://readxl.tidyverse.org/

performance: A package for evaluating the performance of predictive models in R. It provides a set of functions for creating various types of performance metrics and visualizations. Documentation: https://easystats.github.io/performance/

parameters: A package for managing parameters and arguments in R. It provides a set of functions for defining, validating, and documenting parameters in R functions. Documentation: https://www.rdocumentation.org/packages/parameters/versions/0.21.0

see: A package for visualizing the output of R functions. It provides a set of functions for creating visual summaries of data frames, matrices, and other R objects. Documentation: https://cran.r-project.org/web/packages/see/index.html

2.2 Dataset introduction

The unzipped files have been saved into a new folder named data for better organization.

part_info <- read_csv("data/Participants.csv")

Participants.csv

Contains information about the residents of City of Engagement that have agreed to participate in this study.

  • participantId (integer): unique ID assigned to each participant.
  • householdSize (integer): the number of people in the participant’s household
  • haveKids (boolean): whether there are children living in the participant’s household.
  • age (integer): participant’s age in years at the start of the study.
  • educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
  • interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
  • joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")

FinancialJournal.csv

Contains information about financial transactions.

  • participantId (integer): unique ID corresponding to the participant affected
  • timestamp (datetime): the time when the check-in was logged
  • category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
  • amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link

2.3 Examining the datasets

Participant csv

Lets first examine the properties of the participants csv file.

datatable(part_info)
str(part_info)
spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Removed participant Id, education level and interestgroup from the following since summary are not meaningful

summary(part_info[, c("householdSize", "haveKids", "age","joviality")])
 householdSize    haveKids            age          joviality       
 Min.   :1.000   Mode :logical   Min.   :18.00   Min.   :0.000204  
 1st Qu.:1.000   FALSE:710       1st Qu.:29.00   1st Qu.:0.240074  
 Median :2.000   TRUE :301       Median :39.00   Median :0.477539  
 Mean   :1.964                   Mean   :39.07   Mean   :0.493794  
 3rd Qu.:3.000                   3rd Qu.:50.00   3rd Qu.:0.746819  
 Max.   :3.000                   Max.   :60.00   Max.   :0.999234  

Count of resident for each interest group

table(part_info$interestGroup)

  A   B   C   D   E   F   G   H   I   J 
102  91 102  96  83 106 108 111  96 116 

Count of resident for each education level

table(part_info$educationLevel)

          Bachelors            Graduate HighSchoolOrCollege                 Low 
                232                 170                 525                  84 
sum(is.na(part_info))
[1] 0
duplicates1 <- duplicated(part_info)
part_info[duplicates1, ]
# A tibble: 0 × 7
# ℹ 7 variables: participantId <dbl>, householdSize <dbl>, haveKids <lgl>,
#   age <dbl>, educationLevel <chr>, interestGroup <chr>, joviality <dbl>

Since 0 rows are displayed. We can confirm that duplicates do not exist in this csv file.

Finance csv

Lets move on to examine the properties of the finance csv file.

Sneak peak of the first few entries in the dataset

head(finance)
# A tibble: 6 × 4
  participantId timestamp           category  amount
          <dbl> <dttm>              <chr>      <dbl>
1             0 2022-03-01 00:00:00 Wage      2473. 
2             0 2022-03-01 00:00:00 Shelter   -555. 
3             0 2022-03-01 00:00:00 Education  -38.0
4             1 2022-03-01 00:00:00 Wage      2047. 
5             1 2022-03-01 00:00:00 Shelter   -555. 
6             1 2022-03-01 00:00:00 Education  -38.0
Warning

Please do not use datatable here or you will face a error of ‘Fatal javascript OOM in reached Heap Limit’ which indicate that R studio session has run out of memory while attempting to execute JavaScript code.

str(finance)
spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
 $ timestamp    : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
 $ category     : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
 $ amount       : num [1:1513636] 2473 -555 -38 2047 -555 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   timestamp = col_datetime(format = ""),
  ..   category = col_character(),
  ..   amount = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Only ‘amount’ variable summary is displayed since summary of other variables are not useful

summary(finance[c("amount")])
     amount         
 Min.   :-1562.726  
 1st Qu.:   -5.594  
 Median :   -4.000  
 Mean   :   20.047  
 3rd Qu.:   21.598  
 Max.   : 4096.526  
table(finance$category)

     Education           Food     Recreation RentAdjustment        Shelter 
          3319         790051         296013            131          11463 
          Wage 
        412659 
sum(is.na(finance))
[1] 0

There is ni missing values

Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.

Show the code
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
  geom_boxplot() +
  xlab("Expense Category") +
  ylab("Amount") +
  ggtitle("Amount Spent by Expense Category"))

Now lets move to check for duplicates. The code below will display the duplicates in the financial_journal.csv

duplicates <- duplicated(finance)
finance[duplicates, ]
# A tibble: 1,113 × 4
   participantId timestamp           category   amount
           <dbl> <dttm>              <chr>       <dbl>
 1             0 2022-03-01 00:00:00 Shelter    -555. 
 2             0 2022-03-01 00:00:00 Education   -38.0
 3             1 2022-03-01 00:00:00 Shelter    -555. 
 4             1 2022-03-01 00:00:00 Education   -38.0
 5             2 2022-03-01 00:00:00 Shelter    -557. 
 6             2 2022-03-01 00:00:00 Education   -12.8
 7             3 2022-03-01 00:00:00 Shelter    -555. 
 8             3 2022-03-01 00:00:00 Education   -38.0
 9             4 2022-03-01 00:00:00 Shelter   -1556. 
10             4 2022-03-01 00:00:00 Education   -12.8
# ℹ 1,103 more rows

We will remove these duplicates and resave the file as finance using the following code

finance <- distinct(finance)

Lets just confirm that the duplicates are removed using the following code. The output of the following should be the same.

nrow(finance)
[1] 1512523
nrow(distinct(finance))
[1] 1512523
Warning

For better data presentation and consistency, we may encode all expenses into positive instead of negative using the code below. However, we will not do this as we will will analysis rent adjustments which has positive and negative and may cause confusion

#finance$amount <- abs(finance$amount)

2.4 Checking for anomalies in datasets

Show the code
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  datatable()
Note

Clicking from the above table, we do realize that most participant id has transaction in all 12 months but some only had transactions in one month. We should remove them as it is likely they are not residents

Using the code below, we built a table and identify residents with irregular transactions

Show the code
finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarise(num_months = n_distinct(format(date, "%Y-%m"))) %>%
  filter(num_months != max(num_months)) %>%
  datatable()

We will move on to classify the above identified participants as non residents and remove them from the data sets

Removing non-residents from finance dataset

# calculate num_months for each participant
monthly_counts <- finance %>%
  mutate(date = as.Date(timestamp)) %>%
  group_by(participantId) %>%
  summarize(num_months = n_distinct(format(date, "%Y-%m")))

# find participants with num_months different from the maximum num_months
non_residents <- monthly_counts %>%
  filter(num_months != max(num_months))

# remove non-residents from the finance data frame
finance <- finance %>%
  filter(!participantId %in% non_residents$participantId)

A reduction in rows is observed below indicating removal of 131 residents.

nrow(distinct(finance))
[1] 1509897

Removing non-residents from participant dataset

We will also update out participant csv to remove the non residents

# remove non-residents from the finance data frame
part_info <- part_info %>%
  filter(!participantId %in% non_residents$participantId)

We can see a reduction in 131 rows as well

nrow(distinct(part_info))
[1] 880

3. Data visualization

Lets now visualize the data provided in the data set participants.csv

Show the code
# Histogram of age
v1<- ggplot(part_info, aes(x = age)) +
  geom_bar(binwidth = 5) +
  labs(title = "Distribution of Participant Age",
       x = "Age (years)",
       y = "Count")
ggplotly(v1)
Show the code
# Bar chart of education level
v2<- ggplot(part_info, aes(x = educationLevel)) +
  geom_bar() +
  labs(title = "Education Level of Participants",
       x = "Education Level",
       y = "Count")
ggplotly(v2)
Show the code
# Bar chart of household size
v3<- ggplot(part_info, aes(x = householdSize)) +
  geom_bar() +
  labs(title = "Household Size of Participants",
       x = "Household size",
       y = "Count")
ggplotly(v3)
Show the code
# Bar chart of whether participants have children
v4<- ggplot(part_info, aes(x = factor(haveKids))) +
  geom_bar() +
  labs(title = "Proportion of Participants with Children",
       x = "Have Children",
       y = "Count")
ggplotly(v4)
Show the code
v5 <- ggplot(data = part_info, aes(x = interestGroup)) +
      geom_bar(aes(text = paste("\n","Count: ", ..count.., "\n"," Percentage: ", scales::percent(..count../sum(..count..))))) +
      labs(title = "Interest Group Distribution", x = "Interest Group", y = "Count")

ggplotly(v5,tooltip = "text")
Show the code
v6<- ggplot(part_info, aes(x = joviality)) +
  geom_histogram(binwidth = 0.05, fill = "grey", color = "white") +
  labs(title = "Joviality Distribution", x = "Joviality", y = "Count")
ggplotly(v6)

3.1 Exploring participants data

3.1.1 Education vs age

Show the code
v7<- ggplot(part_info, aes(x = age, fill = educationLevel)) +
  geom_bar() +
  labs(title = "Distribution of Education Level by age",
       x = "Age (years)",
       y = "Education Level")
v8<- ggplot(part_info, aes(x = age)) +
  geom_bar() +
  labs(title = "Age Distribution by Education Level",
       x = "Age (years)",
       y = "Count") +
  facet_wrap(~ educationLevel, ncol = 2)

v7 + v8

From the charts, we can observe that across different age, high school educated residents are of the majority while low educated residents are of the minority.

Note

Added fig.height to make sure that the charts are not overly compressed

3.1.2 Household size vs have kids

Show the code
ggplot(part_info, aes(x = haveKids, y = householdSize)) +
  geom_jitter() +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(title = "Relationship between Household Size and Having Children",
       x = "Have Children",
       y = "Household Size")

From the chart, we observe a pattern that only household with 3 person have kids while those with 2 do not. This is useful because for the authorities, provison of subsidies such as milk and diaper vouchers can be directed specifically to family with more than 3 household members

3.1.3 Have kids vs education

Show the code
# We use this line to relabel the education level according to acceding order
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

# convert education level column to our above specified levels
part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

# Calculate percentage of each education level group with children
edu_kids <- part_info %>%
  group_by(educationLevel, haveKids) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count))

# Plot the bar chart
ggplotly(ggplot(edu_kids, aes(x = educationLevel, y = percentage, fill = haveKids)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Percentage of Participants with Children by Education Level",
       x = "Education Level",
       y = "Percentage with Children") +
  scale_fill_manual(values = c("#E69F00", "#56B4E9"), labels = c("No", "Yes")))

From the above we can observe that across all education levels, percentage of residents who have kids are generally lower that than percentage of residents who do not have kids. There seems to be an inverse relationship between education level and if a resident has kids (with exception to high school where it is higher than low education residents).

Note

We specify the desired order of the education levels using the edu_levels vector. Then, we convert the education level column to a factor with the desired levels using the factor() function and the levels

3.1.4 Mean age vs having kids

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean age:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
  stat_summary(aes(y = age, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = age),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

From the chart above, we do not observe any significant difference.However we do note that mean age of residents who do not have kids are sightly lower than those who do have kids.

Let’s perform a statistical test to determine if there is a difference in the mean

Show the code
ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = age,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
   message = TRUE ,
  xlab = "Having Kids",
  ylab = "Age"
)

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean age of resident who have and do not have kids are the same

3.1.5 Percentage of Participants in Each Interest Group by Education Level

Show the code
edu_levels <- c("Low", "HighSchoolOrCollege", "Bachelors", "Graduate")

part_info$educationLevel <- factor(part_info$educationLevel, levels = edu_levels)

grouped_data <- part_info %>%
  group_by(educationLevel, interestGroup) %>%
  summarise(count = n()) %>%
  mutate(percentage = prop.table(count) * 100)

plot <- ggplot(grouped_data, aes(x = educationLevel, y = percentage, fill = interestGroup)) +
  geom_col(position = "dodge") +
  labs(title = "Percentage of Participants in Each Interest Group by Education Level",
       x = "Education Level",
       y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +
  theme()

plotly_plot <- ggplotly(plot)
plotly_plot

From the above chart, we observe that interest group I has exceptionally high percentage of participants from low education level while interest group E has the exceptionally low percentage of participant from low education level residents. For other education levels, interest group participation, percentage of participation across interest groups varies shows less of huge difference. More information may be required to determine the reasons behind.

3.1.6 Mean joviality vs education

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = educationLevel),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We can observe from the above chart that mean joviality increases with increasing education level.

Let’s perform a statistical test to determine if there is a difference in the mean jovialaity

Show the code
ggbetweenstats(
  data = part_info,
  x = educationLevel,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Education Level",
  ylab = "Joviality"
)

At a significance level of 5%, since the p value is smaller than 0.05, we reject the null hypothesis that the mean joviality of resident with different education are the same. Therefore, we confirm that mean joviality is different across education levels.

3.1.7 Mean joviality vs household size

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  paste("Mean jovility:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = householdSize),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  )

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

We observe that mean joviality for single person household is highest display an decreasing trend with increasing number of members of the household. Household size of 3 seems to have the least joviality. A possible reason may be that stress from raising the children may have made them less jovial, this may be a potential area that authorities can examine (with more information) if they hope to raise the mean joviality of the group.

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
ggbetweenstats(
  data = part_info,
  x = householdSize,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Household Size",
  ylab = "Joviality"
)

However, at a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of residents of different household size are the same.

3.1.8 Mean joviality vs have kids

Show the code
tooltip <- function(y, ymax, accuracy = .01) {
  mean <- scales::number(y, accuracy = accuracy)
  sem <- scales::number(ymax - y, accuracy = accuracy)
  havekids <- unique(part_info$haveKids)
  paste("Have Kids:", havekids, "<br>",
        "Mean Joviality:", mean, "+/-", sem)
}

gg_point <- ggplot(data=part_info, 
                   aes(x = haveKids),
) +
  stat_summary(aes(y = joviality, 
                   tooltip = after_stat(  
                     tooltip(y, ymax))),  
    fun.data = "mean_se", 
    geom = GeomInteractiveCol,  
    fill = "light blue"
  ) +
  stat_summary(aes(y = joviality),
    fun.data = mean_se,
    geom = "errorbar", width = 0.2, size = 0.2
  ) +
  labs(
       x = "Have Kids",
       y = "Joviality")

girafe(ggobj = gg_point,
       width_svg = 8,
       height_svg = 8*0.618)

From the chart above, we observe that mean joviality is lower for residents who have kids, this correspond to our previous analysis of mean joviality vs household size where household with 3 residents has the lowest mean joviality which may be caused by additional stress of taking care of kids

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
ggbetweenstats(
  data = part_info,
  x = haveKids,
  y = joviality,
  plot.type = "box",
  pairwise.comparisons = TRUE,
  mean.plotting = TRUE,
  xlab = "Having Kids",
  ylab = "Joviality"
)

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who have and do not have kids are the same. There is insufficient evidence to suggest that there is a difference in mean joviality.

3.1.9 Mean joviality vs age

Show the code
age_joviality <- part_info %>% 
  group_by(age) %>% 
  summarize(mean_joviality = mean(joviality))

ggplotly(ggplot(age_joviality, aes(x = age, y = mean_joviality)) +
  geom_smooth(method = "lm", se = FALSE) + geom_point() +
  labs(x = "Age", y = "Mean Joviality") +
  ggtitle("Mean Joviality against Age"))

From the point plot alone, we do not see a distinct trend with age and mean joviality except that resident of age 53 has the lowest mean joviality and resident of age 59 has the highest mean joviality. When we overlay a smoothed line plot, we can see that mean joviality decreases with increasing age.

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
ggscatterstats(
  data = age_joviality,
  x = age,
  y = mean_joviality,
  marginal = FALSE,
  )+labs(x = "Age", y = "Mean Joviality")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident is different across age group

Further examination
Show the code
# Add a column to indicate outliers using IQR method
is_outlier <- function(x) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x < (qnt[1] - H) | x > (qnt[2] + H)
}

# Create data subset for age 53
age53_joviality <- subset(part_info, age == 53)

# Create outlier column using is_outlier function
age53_joviality$outlier <- is_outlier(age53_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot53 <- ggplot(data = age53_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 53") +
  xlab("") +
  ylab("Joviality")

# Convert ggplot object to plotly
plotly_object <- ggplotly(plot53)


# Create data subset for age 59
age59_joviality <- subset(part_info, age == 59)

# Create outlier column using is_outlier function
age59_joviality$outlier <- is_outlier(age59_joviality$joviality)

# Create ggplot box plot with outlier points colored red
plot59 <- ggplot(data = age59_joviality, aes(x = 1, y = joviality)) +
  geom_boxplot() +
  geom_point(aes(x = jitter(1, factor = 0.3), y = joviality, color = outlier)) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  theme_bw() +
  ggtitle("Joviality at Age 59") +
  xlab("") +
  ylab("Joviality")
# subplot for both plots
subplot(plot53, plot59, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
  layout(title = 'Further checking',
         plot_bgcolor='#e5ecf6', 
         xaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff'), 
         yaxis = list( 
           zerolinecolor = '#ffff', 
           zerolinewidth = 2, 
           gridcolor = 'ffff')) %>%
  layout(annotations = list(
    list(
      x = 0.25,  
      y = 1.0,  
      text = "Distribution of Age 53 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    ),
    list(
      x = 0.75,  
      y = 1.0,  
      text = "Distribution of Age 59 Participants' Joviality",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    )
  ))

We can observe that joviality for age 53 residents are more concentrated to below 0.5 while joviality for age 59 residents are more evenly distributed across the axis. This confirms our expectation that age 53 residents may indeed be generally not jovial while for age 59 residents, there is no such general concentration of joviality observed.

3.2.0 Mean Joviality vs Interest group

Show the code
# Aggregate joviality by interest group
joviality_interest <- aggregate(joviality ~ interestGroup, data = part_info, mean)

# Plot mean joviality by interest group
ggplot(data = joviality_interest, aes(x = interestGroup, y = joviality)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(joviality, 2)), vjust = -0.5) +
  labs(x = "Interest Group", y = "Mean Joviality", 
       title = "Mean Joviality by Interest Group")

Interest group E provides the highest mean joviality. If officials are looking to improve joviality among residents, they may consider encouraging participation in interest group E through ways such as subsidizing group E membership, etc

Let’s perform a statistical test to determine if there is a difference in the mean joviality

Show the code
ggbetweenstats(
  data = part_info,
  x = interestGroup,
  y = joviality,
  plot.type = "box",
  mean.plotting = TRUE,
  xlab = "Interest Group",
  ylab = "Joviality")+
  scale_color_brewer(palette = "Set3")

At a significance level of 5%, since the p value is greater than 0.05, we cannot reject the null hypothesis that the mean joviality of resident who participate in different interest groups are the same.

Note

We use scale_color_brewer() as default palette only has 8 colors. We can define out own colors as well to be used in palette

3.2 Exploring financial data

Next we move on to explore variables in the financial data. Since every participant can have multiple entries. We will explore the data by grouping the entries according the participant’s Id and the category

3.21 Sum of residents expenditure by category

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
  group_by(participantId,category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
ggplotly(expenses_plot)

In the chart above we realize wage is counted as part of expenditure We will remove wage since it is not exactly an expense that we will be looking at.

Expenses by category (removing wage)

Show the code
# Aggregate financial data by participant
financial_data_agg <- finance %>%
   filter(category != "Wage") %>%
  group_by(participantId, category) %>%
  summarize(total = sum(amount), .groups = "drop")

# Financial summary
expenses_summary <- financial_data_agg %>%
  group_by(category) %>%
  summarize(total = sum(total))

# Bar chart of expenses by category
expenses_plot1 <- ggplot(expenses_summary, aes(x = category, y = total)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Expense Category", y = "Total Amount Spent", title = "Expenses by Category") +
  theme_minimal()
ggplotly(expenses_plot1)

From the chart above we realize that shelter accounts for the main category for expenditure. This is followed by recreation, food and education. Rent adjustment is on the negative end which may be a indication that the landlord in the city has overall lowered their rents.

3.2.2 Total wage against month

Next we will examine the wages over the months across the time stamp period.

Show the code
# Convert the timestamp column to a datetime object
finance$timestamp <- as.POSIXct(finance$timestamp)

# Create a line chart of wages per month
wages_per_month <- finance %>%
  filter(category == "Wage") %>%
  group_by(month = floor_date(timestamp, "month")) %>%
  summarize(total_wages = sum(amount)) %>%
  ggplot(aes(x = month, y = total_wages)) +
  geom_line() +
  labs(x = "Month", y = "Total Wages", title = "Total Wages against Month") +
  scale_x_datetime(date_labels = "%b %Y")

# Display the plot
print(wages_per_month)

From the above chart, we realize that in March, total wage is abnormally high. This may be an anomaly that that may require further information to examine. For now, we will retain it as it is.

3.2.3 Average Amount by Category over Time

Show the code
# Aggregate financial data by category and timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  group_by(category, timestamp) %>%
  summarize(avg_amount = mean(amount))

# Line chart of average amount by timestamp, colored by category
line_chart <- ggplot(financial_data_agg, aes(x = timestamp, y = avg_amount, color = category)) +
  geom_line() +
  labs(x = "Timestamp", y = "Average Amount", title = "Average Amount by Category over Time") +
  theme_minimal()

# Display the chart
ggplotly(line_chart)

From the above chart, we observe that mapping average amount for category by days in the time stamp is not visually pleasing to see any patterns. But still if we zoom in, we can get some information from it.We can see that there are some major fluctuation in shelter amount and rent adjustment in march and April. Education amount is incurred on first day of the month and that if we zoom in. We can see that recreation and food demonstrates a regular pattern as shown in the pic below

Lets try to map above information in terms of months

3.2.4 Average Amount Spent by Category per Month

Show the code
# Extract month from timestamp
financial_data_agg <- finance %>%
  filter(category != "Wage") %>%
  mutate(month = format(timestamp, "%Y-%m")) %>%
  group_by(participantId, category, month) %>%
  summarize(total = sum(amount), .groups = "drop")

# Calculate average amount spent by category per month
category_month_avg <- financial_data_agg %>%
  group_by(category, month) %>%
  summarize(avg_amount = mean(total))

# Create bar chart
category_month_plot <- ggplot(category_month_avg, aes(x = month, y = avg_amount, fill = category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Month", y = "Average Amount Spent", title = "Average Amount Spent by Category per Month") +
  theme(axis.text.x = element_text(angle = 45,
                                   vjust = 0.5,
                                   hjust = 0.5))
# Plot interactive chart
ggplotly(category_month_plot)

From the chart above we observe that the amount spent on average per category is indeed in the following order with education < food < recreation < shelter. We can also observe some anomalies in March and April. There is an exceeding large expenditure on shelter and recreation in march and in April the rent adjustment increased exceptionally as well. The expenditure in shelter may have being transferred to increase rent adjustment.

Note

Using theme minimal will result in overlapping of x-axis. I use theme here so that angle of x-axis label can be adjusted 45 degrees

Note

Since not everyone in the city is a landlord / tenant. This chart only serve as a benchmark of average amount city resident spent.

3.3 Merged visualization

Now lets see some visualizations after we merge these 2 files together using the code below

Show the code
financial_journal <- finance %>%
  mutate(date = as.Date(timestamp))

# Aggregate financial data by participantId and category
agg_financial <- financial_journal %>%
  group_by(participantId, category) %>%
  summarize(total_spent = sum(amount), .groups = "drop")

# Merge demographic data with aggregated financial data
merged_data <- part_info %>%
  left_join(agg_financial, by = "participantId")

We can see the residents’ total expenditures in each category and their respective wage using the below code.

Show the code
DT::datatable(merged_data[, c("participantId", "category", "total_spent")], class = 'compact')

Lets move on to see some EDAs

3.3.1 Wage agaist joviality

Show the code
# Subset merged_data to only include rows where category is "wage"
wage_data <- merged_data %>% filter(category == "Wage")

ggplotly(ggplot(wage_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(title = "Wage vs. Joviality", x = "Wage", y = "Joviality")+ geom_point())

Using the smooth trend line, we can observe that joviality generally decrease with increasing wage. This may be because a higher wage may mean more responsibility at work and hence greater amount of stress which lead to lower joviality

Let’s perform a statistical test to determine if there is a correlation between joviality and wage

ggscatterstats(
  data = wage_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Wage", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and wage

3.3.2 Wage against education level

Show the code
wage_data <- merged_data %>% filter(category == "Wage")

ggplot(wage_data, aes(x = educationLevel, y = total_spent)) +
  geom_point() +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Education Level", y = "Wage")

From the chart above we can observe increasing education level of residents has increasing level of wage

3.3.3 Wage against interest group

Show the code
wage_data <- merged_data %>% filter(category == "Wage")

ggplot(wage_data, aes(x = interestGroup, y = total_spent)) +
  geom_point(alpha = 0.5, width = 0.2) +
  geom_boxplot(alpha = 0.5, width = 0.2, outlier.size = 1) +
  labs(x = "Interest Group", y = "Wage")

From the above chart, wage distribution across interest group seems to be well spread. But Interest group I seems to have attracted on one of the largest wage resident in the city.

Let’s perform a statistical test to determine if there is a correlation between joviality and wage

model <- lm(total_spent ~ interestGroup + educationLevel, data = wage_data)
model

Call:
lm(formula = total_spent ~ interestGroup + educationLevel, data = wage_data)

Coefficients:
                      (Intercept)                     interestGroupB  
                         35276.37                           -2849.09  
                   interestGroupC                     interestGroupD  
                         -2936.16                             583.35  
                   interestGroupE                     interestGroupF  
                         -3019.36                           -5420.24  
                   interestGroupG                     interestGroupH  
                         -4753.63                            -993.26  
                   interestGroupI                     interestGroupJ  
                            58.03                           -2917.00  
educationLevelHighSchoolOrCollege            educationLevelBachelors  
                          5816.44                           28023.60  
           educationLevelGraduate  
                         42627.88  
check_collinearity(model)
# Check for Multicollinearity

Low Correlation

           Term  VIF   VIF 95% CI Increased SE Tolerance Tolerance 95% CI
  interestGroup 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
 educationLevel 1.04 [1.01, 1.22]         1.02      0.96     [0.82, 0.99]
Show the code
check_c <- check_collinearity(model)
plot(check_c)

From the results, we can determine that there is a low correlation between education level and interest group attended against wage

3.3.4 Percentage of spending within each category for each group

Show the code
spending_by_kids <- merged_data %>%
  filter(category != "Wage") %>%
  group_by(haveKids, category) %>%
  summarise(total_spending = sum(total_spent)) %>%
  mutate(percentage = total_spending/sum(total_spending)*100)

# Convert haveKids to a factor for easier plotting
spending_by_kids$haveKids <- as.factor(spending_by_kids$haveKids)

spending_by_kids_plot <- ggplot(spending_by_kids, aes(x = haveKids, y = percentage, fill = category)) +
  geom_bar(position = "stack", stat = "identity") +
  scale_fill_viridis_d() +
  theme_minimal() +
  labs(title = "Percentage of Spending by Category and Household Type",
       x = "Household Type",
       y = "Percentage of Spending") +
  guides(fill = guide_legend(title = "Spending Category"))

ggplotly(spending_by_kids_plot)

We can observe that those who do not spend on education at all have no kids while household which do have kids has greater spending on shelter and food and lower on recreation.

3.3.5 Spending category by education level

Show the code
spending_data <- merged_data %>% filter(category != "Wage")

ggplot(spending_data, aes(x = category, y = total_spent, fill = category)) +
  geom_col(position = "dodge") +
  facet_wrap(~educationLevel, ncol = 2) +
  scale_fill_discrete(name = "Spending Category") +
  labs(x = "Spending Category", y = "Amount")

From the chart we can observe that, with the exception of rent adjustment, spending pattern by category across different education level is generally the same. High school educated and bachelors educated has the highest positive rent adjustment which may indicate that there are more landlords from these 2 education backgrounds

3.3.6 Recreation spending vs joviality

Show the code
recreation_data <- merged_data %>%
  filter(category == "Recreation")

# Create scatter plot of joviality vs recreation spending
ggplot(recreation_data, aes(x = total_spent, y = joviality)) +
  geom_smooth() +
  labs(x = "Recreation Spending", y = "Joviality") + geom_point()

From the above observation, it seems that increasing spending would increase joviality until a saturation is reached at around $10000. Note that the chart is plotted on a negative x-axis as spending is accounted for as negative value.

Let’s perform a statistical test to determine if there is a correlation bewteen joviality and recreation spending

Show the code
ggscatterstats(
  data = recreation_data,
  x = total_spent,
  y = joviality,
  marginal = FALSE,
  ) + labs(x = "Recreation", y = "Joviality")

At 5% significance level, since the p value is smaller than 0.05, we reject the null hypothesis that there is no correlation between joviality and recreation expenses

4 Conclusion

In conclusion, I believe that the authority there are a few points in which the they can look at when allocating the grant.

  • Greater funds should be channeled to support the education of residents given that it is the lowest expenditure. Currently education is only observed in residents who has kids and I assume that this expenditure is on the kids only. Perhaps the authorities can look at sponsoring adults continuous learning as well.

  • Mean joviality increases with increase increasing education and that this may be a reason to further support fundings for education purposes.

  • Joviality has an inverse relationship with wage and the manpower authority should perhaps examine the current laws and manpower landscape to determine if employees are properly treated in workplace

  • The mean age of residents having kids is 38 and the percentage of residents who have kids decreases as their education level increase. Targeted family development measures should be put in place if authorities seeks to reduce this age and encourage fertility rate among certain education group level

  • Shelter accounts for the largest expenditure in the category and residents who have kids spends on a larger percentage for shelter. Since shelter is a fundamental necessity, the authorities can perhaps consider rental subsidies and support for qualified residents.

  • Increasing recreation expenses improves joviality to a certain extent. If the city want to improve the happiness of the residents, perhaps they can consider measures such as free tours to parks,etc

To sum up, I have examined some variables and their correlation to each another. I have also provided some areas in which the authority can look at to disburse their city renewal grant and revitalize the community. However, as correlation does not equal causation, the above only serves as a reference to which the authority can look at. Further examination with more in dept data are required to understand the community and solve the problems more efficiently .